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Abstract. When a set of patterns is stored in a distributed memory, any given stor- 
age location participates in the storage of many patterns. From the perspective of 
any one stored pattern, the other patterns act as noise, and such noise limits the 
memory’s storage capacity. The more similar the retrieval cues for two patterns are, 
the more the patterns interfere with each other in memory, and the harder it is to 
separate them on retrieval. This paper describes a method of weighting the retrieval 
cues to reduce such interference and thus to improve the separability of patterns 
that have similar cues. 
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Efficient Packing of Patterns in Sparse Distributed Memory by 
Selective Weighting of Input Bits* 

Pentti KANERVA 


1. INTRODUCTION 

The mathematical theory of sparse distributed memory [3] was developed assuming 
that both the addresses of the storage locations, or hard addresses, and the stored pat- 
terns and their retrieval cues are a uniform random sample of all possible n-bit pat- 
terns. In the terminology of artificial neural nets, assuming a uniform random 
distribution of the hard addresses is equivalent to saying that the input coefficients of 
the hidden layer of a fully connected feed-forward net are set randomly to -Is and Is 
with equal probability. This choice of coefficients works well when the data fill the 
space more or less uniformly. Unfortunately, natural data, such as the bit patterns 
derived from spoken words, never do. Instead, they cluster, and much of the space 
remains unoccupied. The result is that the parts of the memory where the clusters fall 
are overutilized, with stored patterns interfering with each other heavily, while the 
rest of the memory is underutilized. 

One approach to utilizing the memory more efficiently is to fit the hard addresses to 
the data. An extreme case of this is the Hamming network, which has one storage 
location for each item of the data set, with the retrieval cue for that item as its 
address, and the item is stored in that location only. In distributed memories, the hard 
addresses have been chosen in various ways: Joglekar [2] has used patterns from the 
data set itself as hard addresses in experiments with NETtalk data, Danforth [1] has 
used patterns from speech at large as hard addresses in experiments with spoken- 
digit recognition, Rogers [4] has used genetic algorithms to arrive at a set of hard 
addresses in experiments with weather data, and Saarinen et al. [5] have used Koho- 
nen’s self-organization algorithm to arrive at a set of hard addresses in experiments 
with raster pictures of digits. The back-propagation of error, to modify the input coeffi- 
cients of the hidden layer, accomplishes a similar thing, the difference being that the 
back-propagation architectures typically have few hidden units, whereas sparse dis- 
tributed memories have many. 

2. THE PROBLEM 

A complementary approach is discussed in this paper: Assuming that the hard 
addresses of a sparse distributed memory are fixed and nonoptimal for the data, how 

* To appear in 0. Simula (ed.), Proceedings of the International Conference on Artificial Neu- 
ral Networks (ICANN-91, Espoo, Finland) (Amsterdam: Elsevier). 
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to address the memory so that patterns associated with similar cues — and stored pat- 
terns at large — interfere with each other as little as possible? 

The purpose of addressing a sparse distributed memory is to select a small set of 
storage locations, referred to as the active locations , in which an item is stored or from 
which it is retrieved. The active locations for an item are selected based on the dis- 
tance between the item’s retrieval cue and the hard addresses: The locations closest to 

the cue are activated. V 

The standard algorithm for addressing uses Hamming distance, which means that 
all bits are weighted equally. Consequently, the sets activated by two very similar cues 
have a large overlap, and when one of them is used to retrieve its associated pattern, 
the pattern associated with the other is mixed with it heavily (in proportion to the size 
of the overlap) and makes the output ambiguous. 

The standard addressing algorithm is modified by modifying the computation of the 
distance between a retrieval cue and a hard address. Instead of weighting all bits 
equally, each bit of each cue is weighted individually, and the locations with the small- 
est weighted distance to the cue are activated. The problem then is to find a vector of 
nonnegative weights for each item in the data set — for each retrieval cue — such that 
the set of locations activated by any one cue overlaps as little as possible with the sets 
activated by the others. This is the sense in which we try to fit a given set of data as 
well as possible into a given set of storage locations. 

Expressed in symbols (italic for scalars, bold lowercase for vectors, bold uppercase 
for MATRIXES), the problem is the following: We are given a data set of t pairs of 
binary vectors Gcj,yj), i = 1, 2, t, where the are t unique ft-bit retrieval cues (the 
input patterns) and the y, are their associated r-bit patterns to be stored (the desired 
output patterns). We are also given an m x r matrix A of bits, interpreted as m hard 
addresses of a sparse distributed memory (m locations with n-bit addresses; m hidden 
units with n inputs each). 

The set of active locations for the input X( is determined by computing the Ham- 
ming distances between xi and the rows of A and by selecting the k closest rows, where 
& is a parameter of the model ( k is of order square root of m\ in the original sparse dis- 
tributed memory model, distance below a threshold is used as the activation criterion). 

Because there can be several rows at the maximum activation distance, we in fact 
select all the rows (and locations) that are no further than the k closest and indicate 
their number by k + (the exact value of k + depends on x t and A [and on w it see below]). 

For convenience, and without loss of generality, we let the two values of a binary 
variable (a bit) be -1 and 1 when dealing with addresses and with retrieval cues. 
Choosing the k + closest rows according to Hamming distance is then equivalent to 
choosing the k + rows with the largest inner product with so that the k + largest com- 
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ponents of the vector Axi (= c,) indicate the active locations for the input X(. Designate 
this set by a* (aj is an m-bit activation vector, with k + ones indicating the k + active 
locations and zeros elsewhere; = 1 iffc^ > the Mh-largest component ofcj). 

Weighting the bits of a retrieval cue individually can now be expressed as follows. 
The weights for X} are an n-dimensional vector of nonnegative real numbers. Call it 
U7j, and let Wj be the corresponding diagonal matrix. Then the activation vector a, will 
indicate the k + largest components of AW,*, instead of Ax(. We can now state the prob- 
lem as finding t weight vectors such that the corresponding t activation vectors a,- 
cover the space of all possible activation vectors — that is, vectors of k + ones and n-k + 
zeros — as uniformly as possible. 

This statement does not specify the problem fully, because we are not saying how to 
maximize uniformity. Two possibilities come readily to mind, but there can be others. 
One is by minimizing the average overlap and the other by minimizing the maximum 
overlap over all pairs of activation vectors for the data set, where the overlap between 
two activation vectors is the number of ones in their logical AND (which is the number 
of hard locations activated by both of two input patterns). The two methods described 
in this paper are heuristic and do not necessarily achieve either of these minimiza- 
tions, but they do improve the utilization of the memory by spreading the total activity 
rather uniformly over all the storage locations. 

Finally, how are the active locations used? The memory stores r-bit patterns in an 
m x r matrix C, which is initially set to zeros. A row of A and the corresponding row of 
C constitute a storage location (a hard location), with the row of A as its address and 
the row of C as its contents. The pattern yi is stored by adding it to the contents of 
each active location, and a pattern is retrieved by adding the contents of the active 
locations and by thresholding the resulting r sums. In symbols, storing (x^yi) means 
adding the matrix a,yj T to C, and retrieving the pattern for r, means thresholding the 
vector C T a,-, where a; is the activation vector for xi and T means transpose (these are 
the standard outer-product or Hebbian learning rule and the corresponding output 
rule, respectively). 

3. PROPERTIES OF THE WEIGHT VECTOR FOR A RETRIEVAL CUE 

In the following, we use the terms weighted distance and new distance from xi to xj to 
mean the Hamming distance weighted with ivi : d^xj) = w^xj xi) ((a * b ) is the vec- 
tor with Is where a and b differ, and Os elsewhere). To make the weighted distances 
comparable with each other and with the Hamming distance, we normalize the 
weights so that they add to n, Z v Wi v = n ( Wfo 2: 0, v = 1, 2, ...» n). Notice that the 
weighted distance from *,• to xj is usually different from the weighted distance from xj 
to xi , and also that the activation vectors are not affected by the normalizing of the 
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weights. 

How are the weights for xi affected by the other retrieval cues in the data set? If 
xj is far from xi (at about n/2 bits or further), their activation vectors based on uniform 
weighting are well separated, so that xj does not have a great influence on the weights 
for X(. We merely need to make sure that the new distance to xj not be too small. If xj 
is near x„ their initial activation vectors — those based on uniform weighting — have a 
large overlap, and the weights need to be chosen to decrease it. Therefore, the new dis- r 

tance to xj needs to be much larger than the old. The new distance is influenced only 
by the weights for the coordinates at which xj differs from xi, and it equals the sum of 
the weights for those coordinates. The larger this sum is, the larger is the new dis- 
tance, and the smaller is the overlap between the new activation vectors. From these 
observations we conclude that Xj should increase the weights for x± at the places where 
it differs from x^ and that the smaller the number of places at which it differs — that 
is, the closer the two initially are — the larger should the increase be. 

4. WEIGHTS BASED ON THE RETRIEVAL CUES 

If the distribution of the hard addresses is approximately uniform, a good set of 
weights can be derived from the retrieval cues alone. The heuristic algorithm that I 
have used for calculating the weights is as follows: Each xj contributes to wi an incre- 
mental weight vector uj = (xj * Xi)/h(xi,xj), where hi..) is the Hamming distance; and 
Wi is the (normalized) sum of the incremental vectors over all j i. Using the recipro- 
cal of the Hamming distance in the increments is based on a simple probabilistic argu- 
ment, and experiments showed that the reciprocal was better than either its square or 
square root. 

5. WEIGHTS BASED ON THE OUTPUT SUMS 

A more general method of weighting is gotten by looking at the memory’s output. 

First, we store all t retrieval cues — or, rather, the corresponding vectors of Os and Is— 
“autoassociatively” in the standard manner, meaning that the data are of the form 
where y,-„ = 0 if x iv = -1, and 1 otherwise, and that the unweighted distance is 
used in computing the activation vectors. To determine the weights we then do a 
standard memory retrieval with xi as the input, but instead of taking the final thresh- 
olded output, we work with the n sums (Sj = C T a,). What can they tell us? 

Since every point of the space of ft -bit vectors is equivalent to every other point, we 
can simplify the discussion by assuming thatxr { - = — (11... 1) (the vector of— Is) andyj is 
the zero-vector (XOR all addresses and data withyj, and replace s l[; by N — S( v if yju = 1* * 

where N is the total number of patterns stored in the locations activated by x t ). The 
places where xj differs from aTj will them be the Is of yj, and they occur in the sum vec- 
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tor ( i.e yj occurs in it) multiplied by the size of the overlap of the activation vectors a,- 
and ay As discussed above, this is what we want, so that the sum vector s t should pro- 
vide a good set of weights for x; when a:,- = -(11... 1); when Xi * -(11. ..1), the sum vector 
is first transformed as shown above. It also needs to be transformed further. 

I performed a number of experiments to determine how to transform the sums into 
the input weights and arrived at the function 



n .(s 1v /N') e 


where w\ v is the weight before normalization, N' is the total number of patterns other 
than y, that have been stored in the locations activated by Xi (it is the sum of the sizes 
of the overlaps of activation vector a; with all the other activation vectors ay, N' = 
N - k + ), and E > 0 is a parameter ( E is the same for all patterns in any given experi- 
ment). Thus, the unnormalized weights for jc; are between 1 and N'. Pleasing about 
this solution to the weights is that it implies the distribution of the hard addresses. 


6. EXPERIMENTS 

Data. I experimented with three sets of data. In the first set a certain bit is particu- 
larly significant in discriminating between patterns of the set, the second set has a 
tight cluster of patterns in one part of the space (four highly significant bits), and the 
third set is “natural” bit patterns for the capital letters. 

The first data set has 32 31-bit patterns ft = 32, n = 31) arranged symmetrically so 
that the distances from any one pattern to the others are 1, 2, 3, ..., 31 bits. The 31 
bits are divided into five fields of 1, 2, 4, 8, and 16 bits, and each field is set to all Os or 
all Is according to the bits of the binary numbers from 0 to 31. The first five patterns 
thus are (0)(00)(0000)(00000000)(0000000000000000), (1)(OOXOOOO)(8 + 16 Os), (0)(11)- 
(0000X24 Os), (1)(11)(0000)(24 Os), and (0)(00)(1111)(24 Os) (the parens show grouping 
into fields). In this data set, each bit is 0 or 1 with equal probability, but the first bit is 
more important than any other in discriminating between patterns, and bits 2 and 3 
are very important, whereas none of the last 16 bits is particularly important. 

The second data set has 28 36-bit patterns, which are organized in fields of length 
1, 1, 1, 1, 4, 4, 12, and 12 bits. The first 16 patterns consist of all 16 binary combina- 
tions in bits 1-4, with the rest of the patterns being Os. In patterns 17-22, bits 1^1 act 
as a 4-bit field. These patterns, in hex, are 0F0000000, FF0000000, 00F000000, 
FOFOOOOOO, 0FF000000, and FFF000000. In patterns 23-28, bits 1-12 act as a 12-bit 
field. These last six patterns are 000FFF000, FFFFFF000, 000000FFF, FFF000FFF, 
000FFFFFF, and FFFFFFFFF. In this data set, the first four bits are highly significant: 
They define a tight cluster of 16 patterns about the origin. Bits 5—12 are also signifi- 
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cant and, together with bits 1—4, define a loose cluster about the origin. The mean dis- 
tance between the patterns of this data set is 11.3 bits. 

The third data set has 26 35-bit patterns from 7x5 raster images of the capital let- 
ters, given here in hex (the 1st 35 bits each of): 2114AFC62, F463E8C7C, 74610845C, 
F46318C7C, FC21E843E, FC21E8420, 74610BC5E, 8C63F8C62, 71084211C, 388421498, 
8CA98A4A2, 84210843E, 8EEB58C62, 8E6B39C62, 746318C5C, F463E8420, 74631ACDE, 
F463EA4A2, 7460E0C5C, F90842108, 8C6318C5C, 8C62A5108, 8C635AED4, 8C5445462, 
8C5442108, and F8444443E. The closest pair of patterns in this set (D and O) is sepa- 
rated by two bits, and five patterns are three bits away and seven are four bits away 
from the closest other pattern. The mean distance between patterns is 14.1 bits. 

Procedure. The different methods of weighting the retrieval cues were compared 
in a series of experiments. An experiment consisted of four passes. In each pass, the 
data set was written once into the memory (the contents C were first set to Os). The 
addressing of the memory for the first three passes was as follows: (1) uniform weight- 
ing of the retrieval cues; (2) using weights based on the output sums of the first pass 
(see Sec. 5); and (3) using weights based on the retrieval cues (see Sec. 4). The fourth 
pass was for control: (4) a uniform random data set, of the same size as in the first 
three passes, was stored with uniform weighting. 

All experiments were made with uniform random sparse distributed memories with 
300 locations (the address matrix A was set randomly to —Is and Is with equal proba- 
bility; m = 300). In Pass 1, the minimum number of locations activated by a retrieval 
cue, k t was 15; in the other passes, k was adjusted so as to give an average k + as close 
to that for the first pass as possible, so that the memory would hold nearly equal num- 
bers of patterns in the different passes. Since the formula for the weights based on the 
output sums contains a parameter, E, Pass 2 required a number of iterations for find- 
ing the optimal exponent E (optimal was defined as the value that minimizes the 
mean [root mean square] overlap over all pairs of activation vectors for the data set). 
Each experiment was run ten times on each of the three data sets. 

Results. The results are summarized in Tables 1-3. Entries of the form x(y)z give 
the smallest, the mean, and the largest of the ten values obtained. In the headers, t is 
the size of the data set, n is the pattern length, mean k + is the average number of loca- 
tions activated by a retrieval cue, andE is the exponent used in the weight equation in 
Pass 2. 

The row labeled Empty cells gives the number of hard locations that were activated 
by none of the retrieval cues — they are wasted. Max. pats. / cell is the number of times 
that the “busiest” location was activated and written into. Very busy locations become 
noise and thus also are wasted. Var. pats. I cell is the variance of the number of activa- 
tions per location (i.e., the number of patterns stored per location). A small variance 
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TABLE 1 

Experiment with Data Set 1 (f = 32, n = 31, mean k+ = 20.1 (21.5) 22.7, E = 1.9 (2.08) 2.4) 


Pass 

1 

Uniform weights 

2 

Weights from sums 

3 

Weights from data 

4 

Random data 

Empty cells 
Max. pats./cell 
Var. pats./cell 
Mean % overlap 
Max. % overlap 

112 (129) 142 
12(14.1) 16 
7.3 (8.15)9.3 
25.0 (26.5) 29.1 
100 (100) 100 

74 (86.3) 97 
7 (8.0) 9 
3.23 (3.84) 4.53 
14.7 (15.8) 16.6 
50 (55.2) 59 

71 (84.1)93 
8(8.3) 10 
3.47(3.87)4.32 
14.9 (15.7) 16.4 
45 (54.4) 59 

18 (23.9)31 
6(7.7) 10 
1.85 (2.08) 2.39 
10.1 (10.4) 11.0 
36 (45.3) 59 



TABLE 2 



Experiment with Data Set 2 (f = 28, n = 36, mean k * = 

19.1 (21.2) 23.3, E = 

0.20 (0.25) 0.32) 

Pass 

Uniform weights 

Weights from sums Weights from data 

Random data 

Empty cells 
Max. pats./cell 
Var. pats./cell 
Mean % overlap 
Max. % overlap 

120 (144) 159 
19 (21.2) 22 
13.0(15.1) 16.8 
36.9(42.1)46.3 
89 (98.3) 100 

0(6.1) 11 
5 (5.9) 7 
1.25 (1.48) 1.77 
10.0 (10.9) 11.6 
43 (56.6) 74 

0(1.5) 5 
5 (6.7) 9 
1.16(1.54) 1.81 
9.64 (11.2) 12.1 
42 (53.4) 65 

31 (38.6) 49 
6 (6.7) 8 
1 .65 (2.02) 2.48 
9.73(10.6) 11.8 
36 (44.5) 65 



TABLE 3 



Experiment with Data Set 3 (Z = 26, n = 35, mean k + = 

20.9 (21.8) 23.3, E = 

0.26 (0.37) 0.50) 

Pass 

Uniform weights 

Weights from sums 

Weights from data 

Random data 

Empty cells 
Max. pats./cell 
Var. pats./cell 
Mean % overlap 
Max. % overlap 

89 (103) 111 
10(13.5) 17 
4.29 (5.24) 6.08 
18.6 (20.7) 22.8 
73 (80.6) 93 

27 (35.7) 43 
5 (5.6) 7 
1.31 (1.47) 1.60 
9.4 (9.94) 10.4 
32 (37.9) 48 

33 (41.4) 52 
5 (5.8) 7 
1.38 (1.62) 1.86 
10.5 (11,0) 11.7 
38 (47.5) 67 

29 (40.9) 46 
5(6.7) 11 
1.54(1.78) 2.32 
9.69 (10.4) 11.2 
32 (42.1)63 


means uniform utilization of memory (writing into the locations at random results in a 
variance equal to the mean). Mean % overlap is the average (root mean square) over- 
lap over all pairs of activation vectors for the data set, and Max. % overlap is the larg- 
est of them. Small percentages mean good discrimination by the memory. 

I did these same experiments without clearing the memory between the first and 
the second pass — before storing the data with the weighted cues. After the data were 
thus stored twice, unweighted retrieval produced nearly the same set of weights for 
the weighted retrieval as it did with the data stored only unweighted. 
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7. CONCLUSIONS 

Storing a set of input patterns autoassociatively in a sparse distributed memory pro- 
vides a means of finding a set of input weights for these same patterns: the sums 
obtained by reading from the memory can be converted into the weights. Such weight- 
ing of the input patterns improves the utilization of the memory if the set of patterns 
does not match well the set of memory locations. No assumptions are made concerning 
the distribution of the memory locations, so that this method can be used in addition 
to any other method, such as redistributing the memory locations, to improve memory 
utilization. The memory can be used at once for both the unweighted and the weighted 
storing of the patterns, which points a way to a two-step method of storing a set of pat- 
terns in and retrieving it from the memory. The weighting is then like an attentional 
mechanism: the weights from the first pass indicate where the discriminating features 
of the pattern are in relation to this particular set of data and this particular set of 
memory locations. 
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